home *** CD-ROM | disk | FTP | other *** search
Text File | 1994-12-08 | 47.2 KB | 1,298 lines | [TEXT/R*ch] |
- C.S.M.P. Digest Wed, 28 Oct 92 Volume 1 : Issue 198
-
- Today's Topics:
-
- Better random number than Random()?
- BlockZero.c snippet
-
-
-
- The Comp.Sys.Mac.Programmer Digest is moderated by Michael A. Kelly.
-
- The digest is a collection of article threads from the internet newsgroup
- comp.sys.mac.programmer. It is designed for people who read c.s.m.p. semi-
- regularly and want an archive of the discussions. If you don't know what a
- newsgroup is, you probably don't have access to it. Ask your systems
- administrator(s) for details. (This means you can't post questions to the
- digest.)
-
- Each issue of the digest contains one or more sets of articles (called
- threads), with each set corresponding to a 'discussion' of a particular
- subject. The articles are not edited; all articles included in this digest
- are in their original posted form (as received by our news server at
- cs.uoregon.edu). Article threads are not added to the digest until the last
- article added to the thread is at least one month old (this is to ensure that
- the thread is dead before adding it to the digest). Article threads that
- consist of only one message are generally not included in the digest.
-
- The entire digest is available for anonymous ftp from ftp.cs.uoregon.edu
- [128.223.8.8] in the directory /pub/mac/csmp-digest. Be sure to read the
- file /pub/mac/csmp-digest/README before downloading any files. The most
- recent issues are available from sumex-aim.stanford.edu [36.44.0.6] in the
- directory /info-mac/digest/csmp. If you don't have ftp capability, the sumex
- archive has a mail server; send a message with the text '$MACarch help' (no
- quotes) to LISTSERV@ricevm1.rice.edu for more information.
-
- The digest is also available via email. Just send a note saying that you
- want to be on the digest mailing list to mkelly@cs.uoregon.edu, and you will
- automatically receive each new issue as it is created. Sorry, back issues
- are not available through the mailing list.
-
- Send administrative mail to mkelly@cs.uoregon.edu.
-
-
- -------------------------------------------------------
-
- From: hd12@ellis.uchicago.edu (hui dong)
- Subject: Better random number than Random()?
- Date: 18 Sep 92 10:54:10 GMT
- Organization: University of Chicago Computing Organizations
-
- (I did check FAQ before I post this.)
-
- I need a fast (speed is very important) random number generator. Right now
- I am using Random(), but it seems to me the distribution is not uniform
- at all. Is there a better generator? Will it be faster to use my own
- function than to use toolbox routine?
- Thanks for the help.
-
- +++++++++++++++++++++++++++
-
- From: fry@tara.harvard.edu (David Fry)
- Date: 19 Sep 92 05:20:38 GMT
- Organization: Harvard Math Department
-
- In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes:
- >(I did check FAQ before I post this.)
- >
- >I need a fast (speed is very important) random number generator. Right now
- >I am using Random(), but it seems to me the distribution is not uniform
- >at all. Is there a better generator? Will it be faster to use my own
- >function than to use toolbox routine?
- >Thanks for the help.
-
- This probably should be in the FAQ list because it comes up all the
- time (and it seems I'm always the one to answer it). Random() is a
- quite good random number generator, and it is pretty fast. It will
- almost certainly be faster than one you write yourself, provided the
- one you write yourself is as good.
-
- Are you changing the randSeed seed variable before calling it the
- first time? You can use the result of a GetDateTime() call as the
- initial seed.
-
- Why do you think it isn't uniform? Are you doing anything strange
- with the results of Random(), like using a "mod" to get a smaller
- random number? This decreases the quality of your random numbers,
- although not as much with Random() as with some other RNG's.
-
- David Fry fry@math.harvard.edu
- Division of Applied Sciences fry@huma1.bitnet
- Harvard University ...!harvard!huma1!fry
- Cambridge, MA 02138
-
- +++++++++++++++++++++++++++
-
- From: jvp@vlsi1.dell.com (Jim Van Peursem)
- Date: 19 Sep 92 15:56:43 GMT
- Organization: Iowa State University, Ames IA
-
- In <1992Sep19.012039.15704@husc3.harvard.edu> fry@tara.harvard.edu (David Fry) writes:
-
- >>I need a fast (speed is very important) random number generator. Right now
- >>I am using Random(), but it seems to me the distribution is not uniform
- >>at all. Is there a better generator? Will it be faster to use my own
- >>function than to use toolbox routine?
- >>Thanks for the help.
-
- >This probably should be in the FAQ list because it comes up all the
- >time (and it seems I'm always the one to answer it). Random() is a
- >quite good random number generator, and it is pretty fast. It will
- >almost certainly be faster than one you write yourself, provided the
- >one you write yourself is as good.
-
- What kind of RNG is Random()? Isn't it a linear congruential?
- If it is, better quality RNG's exist such as Lagged Fibbonacci.
-
- +---------------------------------------------------------------+
- | Jim Van Peursem - Ph.D. Student - Ham Radio -> KE0PH |
- | System Administrator - Digital Systems Design Lab |
- | Department of Electrical Engineering and Computer Engineering |
- | Iowa State University - Ames, IA 50011 |
- | internet - jvp@cpre1.ee.iastate.edu |
- +---------------------------------------------------------------+
-
- +++++++++++++++++++++++++++
-
- From: sw@network-analysis-ltd.co.uk (Sak Wathanasin)
- Date: 19 Sep 92 08:45:28 GMT
- Organization: Network Analysis Ltd
-
-
- In article <1992Sep18.105410.24438@midway.uchicago.edu> (comp.sys.mac.programmer), hd12@ellis.uchicago.edu (hui dong) writes:
-
- > (I did check FAQ before I post this.)
-
- Well, it ought to be in the FAQ, since this comes up every few months.
- It is covered in the Usenet Mac Programmer's Guide.
-
- > I need a fast (speed is very important) random number generator. Right now
- > I am using Random(), but it seems to me the distribution is not uniform
- > at all. Is there a better generator? Will it be faster to use my own
- > function than to use toolbox routine?
-
- Ignore the 16-bit result returned by Random(), and use the full 32-bit
- value that is in randSeed. Details are in the UMPG. The Random() in rom
- is the same as the Park & Miller "minimal standard RNG" described in a
- CACM paper, and guards against overflows during intermediate calculations.
- You will be hard pushed to write a faster version in C or Pascal, and will
- have to use assembler to beat it.
-
-
- Sak Wathanasin
- Network Analysis Limited
- 178 Wainbody Ave South, Coventry CV3 6BX, UK
-
- uucp: ...!uknet!nan!sw Phone: (+44) 203 419996
- AppleLink: NAN.LTD Internet: sw@network-analysis-ltd.co.uk
-
- +++++++++++++++++++++++++++
-
- From: fry@tara.harvard.edu (David Fry)
- Date: 19 Sep 92 13:42:12 EDT
- Organization: Harvard Math Department
-
- In article <jvp.716918203@vlsi1> jvp@vlsi1.dell.com (Jim Van Peursem) writes:
- > What kind of RNG is Random()? Isn't it a linear congruential?
- >If it is, better quality RNG's exist such as Lagged Fibbonacci.
-
- Yes, it's linear congruential. Certainly better RNG's exist, but for
- more post purposes, Random() is more than good enough. It follows the
- so-called "minimal standard" RNG proposed in "Random Number
- Generators: Good Ones are Hard to Find" by Park and Miller, in
- Communications of the ACM, October 1988. This article details the
- history of faulty RNGs and explains why this is a fast and good one.
-
- As someone else pointed out, to really get the best quality from
- Random(), you should ignore there result of the function (which is
- 16-bits), and use the value in randSeed (which is 32-bits) after
- calling Random(). In that case you have exactly the RGN described in
- the CACM paper. I think even using the 16-bit number is good
- enough for most purposes because the lower 16-bits of the RNG are
- sufficiently random. Still, I admit to using the full 32-bits in my
- work :-).
-
- David Fry fry@math.harvard.edu
- Division of Applied Sciences fry@huma1.bitnet
- Harvard University ...!harvard!huma1!fry
- Cambridge, MA 02138
-
- +++++++++++++++++++++++++++
-
- From: Bruce.Hoult@bbs.actrix.gen.nz
- Organization: Actrix Information Exchange
- Date: Sun, 20 Sep 1992 04:06:49 GMT
-
- In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes:
- > (I did check FAQ before I post this.)
- >
- > I need a fast (speed is very important) random number generator. Right now
- > I am using Random(), but it seems to me the distribution is not uniform
- > at all. Is there a better generator? Will it be faster to use my own
- > function than to use toolbox routine?
- > Thanks for the help.
-
- The toolbox Random() functions is perfectly OK if you use the *seed*
- as your random value and ignore the function result. OTOH, it's not
- exactly fast because of the Trap call overhead.
-
- I've included my own random number generator, which requires a 68020
- or better.
-
- I've also included my attempt at a random generator that produces
- "normally" distributed numbers. It's based on the Central Limits
- Theorem that says the sum of several independant random variables is
- approximately normally distributed, no matter what distribution the
- individual variables have. I found that for my purposes the sum of
- ten uniformally distributed variables was good enough.
-
- The only mildly clever thing about it is that I do a little parallel
- computing by working out the random numbers in the CPU and then adding
- them (to get a greater then 32 bit result) in the coprocessor.
-
- Comment is invited on whether this is a good way to generate a normal
- distribution, or not.
-
- - -- Bruce
-
- - ---------------
-
- procedure MyRandom(var seed:longint);
- inline
- $225F, {movea.l (a7)+,a1}
- $2011, {move.l (a1),d0}
- $4C3C, $0401, $0000, $41A7, {mulu.l #41a7,d1:d0}
- $4C7C, $0401, $7FFF, $FFFF, {divu.l #7fffffff,d1:d0}
- $2281; {move.l d1,(a1)}
-
- procedure MyNormal(
- numToAdd:integer;
- var seed:longint;
- var result:extended
- ); external;
-
- - ------------------------------------------------------
-
- machine mc68020
- mc68881
-
- MyNormal proc export
- link a6,#0
- fmove.w #0,fp0
- move.l 8(a6),a1 ;pointer to result
- move.l 12(a6),a0 ;pointer to seed
- move.l (a0),d0
- move.w 16(a6),d2 ;number of loops
- bra.b loopend
- loop mulu.l #16807,d1:d0 ;7^5
- divu.l #$7fffffff,d1:d0
- fadd.l d1,fp0
- move.l d1,d0
- loopend dbra d2,loop
- move.l d0,(a0) ;update the seed
- fmove.x fp0,(a1) ;return the result
- unlk a6
- move.l (a7)+,a0
- add.l #10,a7
- jmp (a0)
- endproc
- end
- - --
- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
- BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
- "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
- hard disk that fits in your pocket!" "Great! Is it PC compatible?"
-
- +++++++++++++++++++++++++++
-
- From: ksand@apple.com (Kent Sandvik)
- Date: 20 Sep 92 19:06:23 GMT
- Organization: Apple
-
- In article <1992Sep19.012039.15704@husc3.harvard.edu>, fry@tara.harvard.edu
- (David Fry) wrote:
- > In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes:
-
- > >I need a fast (speed is very important) random number generator. Right now
- > >I am using Random(), but it seems to me the distribution is not uniform
- > >at all. Is there a better generator? Will it be faster to use my own
- > >function than to use toolbox routine?
- > >Thanks for the help.
-
- > This probably should be in the FAQ list because it comes up all the
- > time (and it seems I'm always the one to answer it). Random() is a
- > quite good random number generator, and it is pretty fast. It will
- > almost certainly be faster than one you write yourself, provided the
- > one you write yourself is as good.
-
- True, at least the Usenet Mac Programmer Guide has a lot of information
- about alternate random generator algorithms and code. I've done some
- tests myself, and it wasn't really justified to write alternate random
- generators, for instance I didn't see much performance improvements
- by inlining hex code instead of calling a trap. Check the TRandom C++
- class on the developer CD for my silly tests.
-
- Cheers,
- Kent/DTS
- "You should really just relax" -MST3K
- - -------------------
- Kent Sandvik (UUCP: ....!apple!ksand; INTERNET: ksand@apple.com)
- DISCLAIMER: Private activities on the Net.
-
- +++++++++++++++++++++++++++
-
- From: Bruce.Hoult@bbs.actrix.gen.nz
- Date: 21 Sep 92 01:34:32 GMT
- Organization: Actrix Information Exchange
-
- In article <ksand-200992120624@wintermute.apple.com> ksand@apple.com (Kent Sandvik) writes:
-
- > I've done some
- > tests myself, and it wasn't really justified to write alternate random
- > generators, for instance I didn't see much performance improvements
- > by inlining hex code instead of calling a trap. Check the TRandom C++
- > class on the developer CD for my silly tests.
-
- What machine were you testing on?
-
- Whn I wrote the code I posted last night I was on a IIcx and there was
- a *massive* difference in trap overhead vs procedure call overhead.
- The difference seems to have gotten much smaller on the '040 -- in a
- test a while ago on the Q700 I found that a function call and return
- with three longword parameters and a trivial body (like SetPt) took just
- over 1 uS while a similar trap call took 3 uS. (I hope i've
- remembered those right :-)
-
- - --
- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
- BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
- "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
- hard disk that fits in your pocket!" "Great! Is it PC compatible?"
-
- +++++++++++++++++++++++++++
-
- From: ksand@apple.com (Kent Sandvik)
- Date: 22 Sep 92 02:02:51 GMT
- Organization: Apple
-
- In article <1992Sep21.013432.18668@actrix.gen.nz>,
- Bruce.Hoult@bbs.actrix.gen.nz wrote:
- > Whn I wrote the code I posted last night I was on a IIcx and there was
- > a *massive* difference in trap overhead vs procedure call overhead.
- > The difference seems to have gotten much smaller on the '040 -- in a
- > test a while ago on the Q700 I found that a function call and return
- > with three longword parameters and a trivial body (like SetPt) took just
- > over 1 uS while a similar trap call took 3 uS. (I hope i've
- > remembered those right :-)
-
- When I think about it my tests were on a Quadra 900. I will do
- some more additional tests on other machines later.
-
- Kent
- "You should really just relax" -MST3K
- - -------------------
- Kent Sandvik (UUCP: ....!apple!ksand; INTERNET: ksand@apple.com)
- DISCLAIMER: Private activities on the Net.
-
- ---------------------------
-
- From: mgleason@cse.unl.edu (Mike Gleason)
- Subject: BlockZero.c snippet
- Date: 15 Sep 92 18:50:19 GMT
- Organization: NCEMRSoft
-
- I think J. McCarthy & M. Mora started a good idea by posting code snippets here,
- so here's a freebie function that clears an area of memory (which can start
- on an odd boundary). It's much faster than using memset(ptr,0,n) because
- this function uses move.l's for the majority of the block. It compiles
- under Think C, but the asm {} blocks may break other compilers.
-
- I use this function all the time, especially when dealing with parameter
- blocks. e.g.:
-
- #define ZERO(a) BlockZero(&(a), sizeof (a))
-
- {
- HParamBlockRec pb;
-
- ZERO(pb);
- }
-
-
- Code follows. Further optimizations, your cool code snippets, bug fixes
- <gulp> welcome....
-
- - ----------------
-
- void BlockZero(void *area, register unsigned long n);
-
- void BlockZero(void *area, register unsigned long n)
- {
- register char *p = (char *)area;
- register unsigned long fours;
- register unsigned long remainder, zero;
-
- if (n) {
- if ((long) p & 01) {
- /* Ptr on an odd boundary... */
- asm {
- clr.b (p)+
- }
- --n;
- /*
- * Ptr is now on an even boundary,
- * and it's first byte is zeroed.
- */
- }
-
- fours = n / 4L; /* Can be zero. */
-
- /* Do the meat of the block using longwords for speed. */
-
- asm {
- clr.l zero
- bra @2
- @1: move.l zero, (p)+
- @2: subq.l #1, fours
- tst.l fours
- bge @1
- }
-
- /*
- * If the size of the block isn't divisible by 4, we'll
- * have to do the remaining bytes (up to 3) by hand.
- */
-
- remainder = n & 03; /* same as remainder = n % 4. */
-
- asm {
- bra @4
- @3: clr.b (p)+
- @4: dbra remainder, @3
- }
- }
- } /* BlockZero */
-
- /* eof BlockZero.c */
- - --
- - --mg mgleason@cse.unl.edu
-
- +++++++++++++++++++++++++++
-
- From: keith@taligent.com (Keith Rollin)
- Date: 16 Sep 92 17:55:14 GMT
- Organization: Taligent
-
- In article <1992Sep15.185019.22483@unlinfo.unl.edu>, mgleason@cse.unl.edu
- (Mike Gleason) wrote:
- >
- > I think J. McCarthy & M. Mora started a good idea by posting code snippets here,
- > so here's a freebie function that clears an area of memory (which can start
- > on an odd boundary). It's much faster than using memset(ptr,0,n) because
- > this function uses move.l's for the majority of the block. It compiles
- > under Think C, but the asm {} blocks may break other compilers.
- >
- > [ code deleted ]
- >
-
- True, not only will this not work under MPW (which doesn't have inline
- asm), but MPW's memset library routine is better than the one you post.
- Specifically, for large enough blocks, it uses a series of 4 MOVE.L
- instructions in a row for setting the block. On the other hand, I don't
- know what the timing differences between MOVE.L and CLR.L are, so perhaps a
- specialized bzero() based on MPW's algorithm would be nice.
-
- - -----
- Keith Rollin
- Phantom Programmer
- Taligent, Inc.
-
- +++++++++++++++++++++++++++
-
- From: nagel@Cigna.COM (Mark Nagel)
- Date: 16 Sep 92 23:09:16 GMT
- Organization: CIGNA FIRST, Irvine, CA
-
- In <keith-160992104732@kip-58.taligent.com> keith@taligent.com (Keith Rollin) writes:
-
- >instructions in a row for setting the block. On the other hand, I don't
- >know what the timing differences between MOVE.L and CLR.L are, so perhaps a
- >specialized bzero() based on MPW's algorithm would be nice.
-
- I'm curious about this. I picked up a Motorola 68K family processor
- manual, but it doesn't have instruction timing info. I have noticed
- that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L
- D0. I gather from this that MOVEQ.L is faster, but I'd still like
- to see timings. Anyone know a good assembler book for the 68K that
- has info like that? Something along the lines of "Zen and the Art
- of Assembly Language Programming" would be wonderful.
-
- Mark
- - --
- Mark Nagel <nagel@cigna.com> | DISCLAIMER: Any resemblence of these
- Network Administrator | opinions to those of CIGNA are purely
- CIGNA/FIRST | coincidental.
- (Don't start vast projects with half-vast ideas)
-
- +++++++++++++++++++++++++++
-
- From: gurgle@netcom.com (Pete Gontier)
- Date: 17 Sep 92 02:47:01 GMT
- Organization: cellular
-
- nagel@Cigna.COM (Mark Nagel) writes:
-
- >I'm curious about this. I picked up a Motorola 68K family processor
- >manual, but it doesn't have instruction timing info. I have noticed
- >that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L
- >D0. I gather from this that MOVEQ.L is faster, but I'd still like
- >to see timings.
-
- Notice, though, that MOVEQ has very limited destination operand types.
- If I remember correctly, registers only. So that renders its use in
- the context of bzero( ) pretty dubious. Me, I just call NewHandleClear
- in the first place and assume Apple's Cray can write better 68000
- than me. :-)
- - --
- Pete Gontier // EC Technology // gurgle@netcom.com
-
- +++++++++++++++++++++++++++
-
- From: Bruce.Hoult@bbs.actrix.gen.nz
- Date: 17 Sep 92 05:42:50 GMT
- Organization: Actrix Information Exchange
-
- In article <1992Sep15.185019.22483@unlinfo.unl.edu> mgleason@cse.unl.edu (Mike Gleason) writes:
- > I think J. McCarthy & M. Mora started a good idea by posting code snippets here,
- > so here's a freebie function that clears an area of memory (which can start
- > on an odd boundary). It's much faster than using memset(ptr,0,n) because
- > this function uses move.l's for the majority of the block. It compiles
- > under Think C, but the asm {} blocks may break other compilers.
-
- You should add in a possible 2 byte move at the start and end to align
- the pointer on a longword boundary. This will increase speed on the
- 32-bit bus machines.
-
- You should unroll the main loop to do several move.l's each time around.
-
- You should use dbra for the loop control (only two bytes longer than your
- version and much faster)...
-
- bra @3
- @1 swap fours
- @2 move.l zero,(p)+ # or several of them..
- @3 dbra fours, @2
- swap fours
- dbra fours, @1
-
- - --
- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
- BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
- "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
- hard disk that fits in your pocket!" "Great! Is it PC compatible?"
-
- +++++++++++++++++++++++++++
-
- From: Bruce.Hoult@bbs.actrix.gen.nz
- Date: 17 Sep 92 10:08:16 GMT
- Organization: Actrix Information Exchange
-
- In article <1992Sep16.230916.18166@Cigna.COM> nagel@Cigna.COM (Mark Nagel) writes:
-
- > I'm curious about this. I picked up a Motorola 68K family processor
- > manual, but it doesn't have instruction timing info. I have noticed
- > that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L
- > D0. I gather from this that MOVEQ.L is faster, but I'd still like
- > to see timings. Anyone know a good assembler book for the 68K that
- > has info like that? Something along the lines of "Zen and the Art
- > of Assembly Language Programming" would be wonderful.
-
- My official Motorola 68000 (M68000UM/AD REV 4 1986) and 68020
- (MC68020UM/AD REV 1 1985) books both have tables of timing
- information, but the 68020 one is full of disclaimers about how
- complex figuring timings is on modern architectures, and in fact I
- think motorola quit publishing instruction timing tables. You've
- really got to try *your* instruction sequence to find out, unless
- you've got really intimate knowledge of the undocumented internals of
- the pipeline etc.
- - --
- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
- BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
- "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
- hard disk that fits in your pocket!" "Great! Is it PC compatible?"
-
- +++++++++++++++++++++++++++
-
- From: mgleason@cse.unl.edu (Mike Gleason)
- Date: 17 Sep 92 18:29:11 GMT
- Organization: NCEMRSoft
-
- Bruce.Hoult@bbs.actrix.gen.nz writes:
-
- >You should unroll the main loop to do several move.l's each time around.
-
- >You should use dbra for the loop control (only two bytes longer than your
- >version and much faster)...
-
- I tried this for the main block, but if I recall, dbra's loop counter
- was acting like a 16-bit word instead of a nice big longword.
-
- Some other folks have sent me some tips, so I will tweak it more using
- these tricks that aren't covered in my VAX Assembler book. I have attempted
- to get my hands on a 68000 assembly book for the mac, but so far all the
- books mentioned in the FAQ (and the one E. Wylie suggested recently) are
- either out of print or my book store can't order them :-(
-
- - --
- - --mg mgleason@cse.unl.edu
-
- +++++++++++++++++++++++++++
-
- From: mgleason@cse.unl.edu (Mike Gleason)
- Date: 18 Sep 92 05:27:49 GMT
- Organization: NCEMRSoft
-
- Here's a much improved BlockZero function. This one now long word aligns,
- and instead of using one small move.l loop to clear 4 bytes at a time, this
- version clears 144 bytes per loop iteration, then 16 bytes at a time when
- the block has less than 144 dirty bytes left, then 1 byte at a time for any
- remaining bytes. In other words, it performs OK on small blocks, pretty
- good on medium size blocks, and excellent on large blocks. Of course there
- aren't many times where one would need to zero a big block, since most big
- blocks are pointers that can be cleared the same time they are allocated,
- but it still makes a good general-purpose bzero.
-
- Thanks to everyone who responded, especially John Bruner and Michael Hecht.
- I'm sure there are more optimizations that could be done, but I'll let the
- Gods of the 68000 worry about that :-)
-
- Here it is, it requires Think C to compile.
-
- /* BlockZero.c */
-
- /* Handy macro: */
- #define ZERO(block,size) BlockZero(&(block), (long) (size))
-
- void BlockZero(void *area, long n);
-
- #define PTR a0 /* Our copy of 'area.' */
- #define COUNT d0 /* Our copy of 'n.' */
-
- /*
- * Using registers d1 - d7, and a1 - a5 to hold zeroes for use
- * with MOVEM. Could use a6 also if you can trick the inline
- * assembler to not use LINK/UNLK.
- */
- #define nAddrRegsUsed 5
- #define nDataRegsUsed 7
-
- /*
- * One MOVEM instruction will clear this many bytes at a time.
- */
- #define ZSize ((nAddrRegsUsed + nDataRegsUsed) * sizeof (long))
-
- /*
- * This is how many MOVEM instructions you want in your loop. For each
- * extra one you add, then large blocks will clear faster; however
- * smaller blocks will be slower because it will use the intermediate
- * loop, where only 16 bytes are cleared at a time.
- */
- #define UnrolledLoops 3
-
- /*
- * For each UnrolledLoop, do one of these.
- */
- #define Loop movem.l d1-d7/a1-a5, -(PTR)
-
- /*
- * This is how many bytes are cleared by all of your MOVEMs before
- * we have to decrement the count and start another loop iteration.
- */
- #define TotalSize (ZSize * UnrolledLoops)
-
- void BlockZero(void *area, long n)
- {
- asm {
- movem.l d0-d7/a0-a5, -(sp)
-
- move.l n, COUNT ; Could use index from the SP
- movea.l area, PTR ; for these instead of a6.
-
- add.l COUNT, PTR ; Start at end of block,
- ; and go backwards.
-
- move.w PTR, d1 ; Find (area + n) % 4.
- andi.w #3, d1
-
- sub.l d1, COUNT ; We will byte-zero this
- ; many bytes to longword align,
- ; so subtract this from the
- ; total.
-
- moveq.l #0, d2
-
- bra @2 ; Zero the last 3 or so
- @1: move.b d2, -(PTR) ; bytes to align PTR on a
- @2: dbra d1, @1 ; longword boundary.
-
- ; Clear out as many registers
- ; as possible to use with MOVEM.
-
- move.l d2, d1 ; Don't need d1 for scratch.
- move.l d2, d3
- move.l d2, d4
-
- cmpi.l #TotalSize, COUNT
- bls @6 ; If it is a smaller block, then
- ; jump to the section that clears
- ; 16 at a time.
-
- ; Otherwise we'll need to clear
- ; out the rest of these registers
- ; to use in a huge MOVEM.
- move.l d2, d5
- move.l d2, d6
- move.l d2, d7
- movea.l d2, a1
- movea.l a1, a2
- movea.l a1, a3
- movea.l a1, a4
- movea.l a1, a5
-
- ; Now zero as much as we can
- ; in blocks of 144 (3 movem's
- ; of 48 bytes each).
-
- bra @4
- @3: Loop ; Add more Loops here if you
- Loop ; like, but adjust the #defines
- Loop ; above.
- @4: subi.l #TotalSize, COUNT
- bge @3
-
- ; The remaining block is now
- ; less than TotalSize, so now
- ; zero as much as we can
- ; in blocks of 16.
-
- addi.l #TotalSize, COUNT
-
- bra @6
- @5: movem.l d1-d4, -(PTR)
- @6: subi.l #16, COUNT
- bge @5
- ; The remaining block is now
- ; less than 16, so now
- ; use byte-zeroing.
-
- addi.l #16, COUNT
- bra @8
- @7: move.b d2, -(PTR)
- @8: dbra COUNT, @7
- ; Restore nuked registers.
- movem.l (sp)+,d0-d7/a0-a5
- }
- } /* BlockZero */
-
- /* eof BlockZero.c */
- - --
- ______________________________________________________________________________
- mike gleason mgleason@cse.unl.edu NCEMRSoft, baby!
-
- +++++++++++++++++++++++++++
-
- From: d88-jwa@musta.nada.kth.se (Jon Wtte)
- Date: 18 Sep 92 09:59:59 GMT
- Organization: Royal Institute of Technology, Stockholm, Sweden
-
- > mgleason@cse.unl.edu (Mike Gleason) writes:
-
- I tried this for the main block, but if I recall, dbra's loop counter
- was acting like a 16-bit word instead of a nice big longword.
-
- Well, using the SWAP function you can use DBRA as a 32-bit counter.
- Still beats the MS-DOS programming model :-)
-
- - --
- Jon W{tte, h+@nada.kth.se, Sweden, Phone +46-8-107069
-
- Help eradicate FIDO-Net <-> Usenet gateways in our time!
-
- +++++++++++++++++++++++++++
-
- From: Bruce.Hoult@bbs.actrix.gen.nz
- Date: 18 Sep 92 07:55:21 GMT
- Organization: Actrix Information Exchange
-
- In article <1992Sep17.182911.27693@unlinfo.unl.edu> mgleason@cse.unl.edu (Mike Gleason) writes:
- > Bruce.Hoult@bbs.actrix.gen.nz writes:
- >
- > >You should use dbra for the loop control (only two bytes longer than your
- > >version and much faster)...
- >
- > I tried this for the main block, but if I recall, dbra's loop counter
- > was acting like a 16-bit word instead of a nice big longword.
-
- Correct -- that's why the code I showed used *two* DBRA's in nested
- loops, with the tricky little SWAPs to get the high word of the count
- register into the low word for the outer DBRA and back again. It's a
- little bit ugly but correctly handles a 32-bit count, is only 2 bytes
- bigger than what you were doing, and the extra SWAPs only get executed
- once every 65536 loops so their overhead is almost zero. All of which
- make this method (thanks to Lawrence for originally pointing the
- method out!) overall tidier than the obvious nested DBRA method of
- putting 16-bit counts in two different registers -- which has a little
- less overhead every 65536 loops, but uses at minimum an extra MOVE.L
- and SWAP in the loop prologue and uses an extra precious register,
- for the same size code.
-
- Isn't it amazing how much fun saving a couple of bytes or cycles can
- be? ;-)
- - --
- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
- BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
- "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB
- hard disk that fits in your pocket!" "Great! Is it PC compatible?"
-
- +++++++++++++++++++++++++++
-
- From: wombat@claris.com (Scott Lindsey)
- Date: 18 Sep 92 09:18:30 GMT
- Organization: Claris Corporation, Vancouver WA
-
- In article <keith-160992104732@kip-58.taligent.com>, keith@taligent.com (Keith Rollin) writes:
- >
- > On the other hand, I don't
- > know what the timing differences between MOVE.L and CLR.L are, so perhaps a
-
- Well, according to the 5th edition of the M68000 Programmer's Reference Manual
- from Motorola,
-
- MOVE.L #<data>,(An)+ 20(3/2)
- CLR.L (An)+ 24(2/4) + 16(4/0)
-
- where the first number is clock periods, and the 2 parenthesized are bus read &
- write cycles. the 16(4/0) is the time required for the (An)+ effective address
- calculation. So on a 68000, the MOVE beats the CLR.
-
- The manual also includes timings for the 68010. The MOVE timing is same,
- but the CLR timing is 12(1/2) for (An)+ with no additional time for address
- calculation. So CLR is the winner. I don't have timings for 020, 030, 040's,
- but one would expect them to be more like the 010 than the 000. Note that the
- 010 also has a "loop mode" where opcode fetches are suppressed, speeding up
- the loop even more.
-
- So I would guess that if you're optimizing for the 68000 (which needs it the
- most) then use MOVE, but use CLR if you're optimizing for > 68000.
-
-
- - --
- Scott Lindsey <wombat@claris.com>
- QUIT
-
- Scott Lindsey <wombat@claris.com>
-
- +++++++++++++++++++++++++++
-
- From: peterc@cubetech.com (Peter Creath)
- Date: Thu, 17 Sep 92 22:25:37 CDT
- Organization: Cube Technologies
-
- Here's my optimization on Mike Gleason's BlockZero.c snippet:
-
- (I'm gonna remove the comments within the code and explain it here...)
-
- This is shorter, primarily because the code itself (not the declarations)
- is entirely in assembly. Therefore, it can be used by any Pascal
- or C compiler with an inline assembler.
-
- This is also slightly faster and smaller, since I killed possible
- inefficiencies in the compiled code. The original routine was designed
- for speed, not for size, so I added an option that compiles a MUCH
- smaller (but slower) version:
-
- #define FAST /* leave undefined for the SMALL version */
- #define ZERO(a) BlockZero(&(a), sizeof (a))
-
- {
- HParamBlockRec pb;
- ZERO(pb);
- }
-
- void BlockZero(void *area, register unsigned long n);
-
- void BlockZero(void *area, register unsigned long n)
- {
- register char *p = (char *)area;
- register unsigned long temp;
- register unsigned long zero;
-
- #ifdef FAST
- asm {
- tst.l n
- beq.s @skipit /* only if len > 0 */
- move.w p,temp
- andi.w #0x01,temp /* if odd offset, */
- beq.s @skipeven
- clr.b (p)+ /* clear odd byte and */
- subq.l #0x01,n /* make offset even */
- @skipeven:
- move.l n,temp
- asr.l #0x02,temp /* len/4 (# of longs in area) */
- bra.s @fourloop
- @clrloop: /* clear out longs speedily */
- clr.l (p)+
- @fourloop:
- dbf temp,@clrloop
- @nofours:
- move.b n,temp
- andi.b #0x03,temp /* check remainder (len % 4) */
- bra.s @oddloop
- @clearodd:
- clr.b (p)+ /* clear remaining bytes */
- @oddloop:
- dbf temp, @clearodd
- @skipit:
- }
- #else /* if SMALL */
- asm {
- move.l n,temp
- bra.s @loopback
- @loop:
- clr.b (p)+
- @loopback:
- dbf temp,@loop
- }
- #endif
- } /* BlockZero */
-
- /* eof BlockZero.c */
-
-
- - ----------------------------------------------------------------------------
- Peter Creath "When I was a boy I was told that anybody could
- peterc@cubetech.com become president; I'm beginning to believe it."
- -- Clarence Darrow
-
- +++++++++++++++++++++++++++
-
- From: mgleason@cse.unl.edu (Mike Gleason)
- Date: 21 Sep 92 03:10:42 GMT
- Organization: NCEMRSoft
-
- peterc@cubetech.com (Peter Creath) writes:
-
- >Lines: 75
-
- >Here's my optimization on Mike Gleason's [original] BlockZero.c snippet:
-
- >(I'm gonna remove the comments within the code and explain it here...)
-
- >This is shorter, primarily because the code itself (not the declarations)
- >is entirely in assembly. Therefore, it can be used by any Pascal
- >or C compiler with an inline assembler.
-
- >This is also slightly faster and smaller, since I killed possible
- >inefficiencies in the compiled code. The original routine was designed
- >for speed, not for size, so I added an option that compiles a MUCH
- >smaller (but slower) version:
-
- Thanks for the improvements; unfortunately they were made to the first
- version. The second one I posted is over 3 times faster. It compiles
- to about 146 bytes (as opposed to 84 bytes for version 1), so you may
- want to re-optimize it for size if the extra space is bothersome.
-
- - --mg
-
- p.s., here's the benchmarks of my functions. Please don't make fun of my
- slow 7.x MHz CPU :-)
-
- Function 0 is Symantec's memset;
- 1 is my first BlockZero;
- 2 is the new BlockZero.
- 3 is Michael Hecht's wzbzeri;
-
- Function 0: 7 bytes in 89 ticks 59.91 KB/sec
- Function 1: 7 bytes in 112 ticks 47.61 KB/sec
- Function 2: 7 bytes in 162 ticks 32.91 KB/sec
- Function 3: 7 bytes in 170 ticks 31.36 KB/sec
-
- Function 0: 99 bytes in 464 ticks 162.52 KB/sec
- Function 1: 99 bytes in 216 ticks 349.12 KB/sec
- Function 2: 99 bytes in 194 ticks 388.71 KB/sec
- Function 3: 99 bytes in 261 ticks 288.93 KB/sec
-
- Function 0: 335 bytes in 1423 ticks 179.32 KB/sec
- Function 1: 335 bytes in 495 ticks 515.51 KB/sec
- Function 2: 335 bytes in 204 ticks 1250.86 KB/sec
- Function 3: 335 bytes in 460 ticks 554.73 KB/sec
-
- Function 0: 2048 bytes in 8154 ticks 191.32 KB/sec
- Function 1: 2048 bytes in 2499 ticks 624.25 KB/sec
- Function 2: 2048 bytes in 910 ticks 1714.29 KB/sec
- Function 3: 2048 bytes in 1291 ticks 1208.37 KB/sec
-
- Function 0: 308294 bytes in 1243151 ticks 188.90 KB/sec
- Function 1: 308294 bytes in 358464 ticks 655.11 KB/sec
- Function 2: 308294 bytes in 114221 ticks 2055.96 KB/sec
- Function 3: 308294 bytes in 170072 ticks 1380.79 KB/sec
-
- - --
- - --mg mgleason@cse.unl.edu
-
- +++++++++++++++++++++++++++
-
- From: keith@taligent.com (Keith Rollin)
- Date: 21 Sep 92 18:17:57 GMT
- Organization: Taligent
-
- In article <1992Sep21.031042.22722@unlinfo.unl.edu>, mgleason@cse.unl.edu
- (Mike Gleason) wrote:
- >
- > p.s., here's the benchmarks of my functions. Please don't make fun of my
- > slow 7.x MHz CPU :-)
- >
- > Function 0 is Symantec's memset;
- > 1 is my first BlockZero;
- > 2 is the new BlockZero.
- > 3 is Michael Hecht's wzbzeri;
- >
- > Function 2: 7 bytes in 162 ticks 32.91 KB/sec
- > Function 2: 99 bytes in 194 ticks 388.71 KB/sec
- > Function 2: 335 bytes in 204 ticks 1250.86 KB/sec
- >
-
- I'm sorry, but I gotta make fun of Mike's slow 7.x MHz CPU.
-
- NEARLY 3 SECONDS TO CLEAR 7 BYTES???
-
- Also, what's the deal with it taking 162 "units" to clear 7 bytes, and only
- 20% longer to clear more than 12 times that much? Or 194 units to clear 99
- bytes and only 10 units more to clear over 3 times that much? The
- overhead's gotta be killer.
-
- And the math doesn't work... (7 bytes / 162 ticks) * (60 ticks / 1 second)
- comes to .00259 KB/second. You'd have to assume Mike ran 12,693 iterations
- in order to get the tickcount he did on that first test.
-
- - -----
- Keith Rollin
- Phantom Programmer
- Taligent, Inc.
-
- +++++++++++++++++++++++++++
-
- From: mgleason@cse.unl.edu (Mike Gleason)
- Date: 21 Sep 92 22:36:09 GMT
- Organization: NCEMRSoft
-
- keith@taligent.com (Keith Rollin) writes:
-
- >In article <1992Sep21.031042.22722@unlinfo.unl.edu>, mgleason@cse.unl.edu
- >(Mike Gleason) wrote:
- >>
- >> p.s., here's the benchmarks of my functions. Please don't make fun of my
- >> slow 7.x MHz CPU :-)
- >>
- >> Function 0 is Symantec's memset;
- >> 1 is my first BlockZero;
- >> 2 is the new BlockZero.
- >> 3 is Michael Hecht's wzbzeri;
- >>
- >> Function 2: 7 bytes in 162 ticks 32.91 KB/sec
- >> Function 2: 99 bytes in 194 ticks 388.71 KB/sec
- >> Function 2: 335 bytes in 204 ticks 1250.86 KB/sec
- >>
-
- >I'm sorry, but I gotta make fun of Mike's slow 7.x MHz CPU.
-
- >NEARLY 3 SECONDS TO CLEAR 7 BYTES???
-
- :-) I evidently forgot to mention that I was using VIA_ticks() that comes
- with Symantec's ANSI library. According to them there are "approximately
- 780,000" per second. Originally my test suite was using system Ticks, but
- 60 units per second isn't enough so I decided to try using the VIA timer
- that Symantec's profiler uses.
-
- >Also, what's the deal with it taking 162 "units" to clear 7 bytes, and only
- >20% longer to clear more than 12 times that much? Or 194 units to clear 99
- >bytes and only 10 units more to clear over 3 times that much? The
- >overhead's gotta be killer.
-
- There is a bit of overhead. Here's a bit of pseudo-code to illustrate it:
- (Did the second version of the code get through? You apparently haven't
- seen it and I have had an email request. I guess I will just append it
- again at the end of this message; sorry if this is wasting bandwidth.)
-
- make a copy of 'n' for the counter;
- make a copy of the pointer to the area to be zeroed;
- align our ptr on a longword boundary;
- if (n >= 144) then
- zero 144 at a time, subtract amount zeroed from n;
- if (n >= 16) then
- zero 16 a time, subtract amount zeroed from n;
- if (n >= 1) then
- zero 1 at a time.
-
- If you look at the results for Symantec's memset(), which is just one byte
- at a time, it doesn't do much better for small sizes, despite having little
- overhead. It just looks like mine has a lot of overhead because it does
- so much better on big blocks.
-
- Function 0: (memset) 7 bytes in 89 ticks 59.91 KB/sec
- Function 0: 99 bytes in 464 ticks 162.52 KB/sec
- Function 0: 335 bytes in 1423 ticks 179.32 KB/sec
-
- - --mg
-
- (Version 2 of BlockZero follows...)
-
- /* BlockZero.c */
-
- /* Handy macro: */
- #define ZERO(block,size) BlockZero(&(block), (long) (size))
-
- void BlockZero(void *area, long n);
-
- #define PTR a0 /* Our copy of 'area.' */
- #define COUNT d0 /* Our copy of 'n.' */
-
- /*
- * Using registers d1 - d7, and a1 - a5 to hold zeroes for use
- * with MOVEM. Could use a6 also if you can trick the inline
- * assembler to not use LINK/UNLK.
- */
- #define nAddrRegsUsed 5
- #define nDataRegsUsed 7
-
- /*
- * One MOVEM instruction will clear this many bytes at a time.
- */
- #define ZSize ((nAddrRegsUsed + nDataRegsUsed) * sizeof (long))
-
- /*
- * This is how many MOVEM instructions you want in your loop. For each
- * extra one you add, then large blocks will clear faster; however
- * smaller blocks will be slower because it will use the intermediate
- * loop, where only 16 bytes are cleared at a time.
- */
- #define UnrolledLoops 3
-
- /*
- * For each UnrolledLoop, do one of these.
- */
- #define Loop movem.l d1-d7/a1-a5, -(PTR)
-
- /*
- * This is how many bytes are cleared by all of your MOVEMs before
- * we have to decrement the count and start another loop iteration.
- */
- #define TotalSize (ZSize * UnrolledLoops)
-
- void BlockZero(void *area, long n)
- {
- asm {
- movem.l d0-d7/a0-a5, -(sp)
-
- move.l n, COUNT ; Could use index from the SP
- movea.l area, PTR ; for these instead of a6.
-
- add.l COUNT, PTR ; Start at end of block,
- ; and go backwards.
-
- move.w PTR, d1 ; Find (area + n) % 4.
- andi.w #3, d1
-
- sub.l d1, COUNT ; We will byte-zero this
- ; many bytes to longword align,
- ; so subtract this from the
- ; total.
-
- moveq.l #0, d2
-
- bra @2 ; Zero the last 3 or so
- @1: move.b d2, -(PTR) ; bytes to align PTR on a
- @2: dbra d1, @1 ; longword boundary.
-
- ; Clear out as many registers
- ; as possible to use with MOVEM.
-
- move.l d2, d1 ; Don't need d1 for scratch.
- move.l d2, d3
- move.l d2, d4
-
- cmpi.l #TotalSize, COUNT
- bls @6 ; If it is a smaller block, then
- ; jump to the section that clears
- ; 16 at a time.
-
- ; Otherwise we'll need to clear
- ; out the rest of these registers
- ; to use in a huge MOVEM.
- move.l d2, d5
- move.l d2, d6
- move.l d2, d7
- movea.l d2, a1
- movea.l a1, a2
- movea.l a1, a3
- movea.l a1, a4
- movea.l a1, a5
-
- ; Now zero as much as we can
- ; in blocks of 144 (3 movem's
- ; of 48 bytes each).
-
- bra @4
- @3: Loop ; Add more Loops here if you
- Loop ; like, but adjust the #defines
- Loop ; above.
- @4: subi.l #TotalSize, COUNT
- bge @3
-
- ; The remaining block is now
- ; less than TotalSize, so now
- ; zero as much as we can
- ; in blocks of 16.
-
- addi.l #TotalSize, COUNT
-
- bra @6
- @5: movem.l d1-d4, -(PTR)
- @6: subi.l #16, COUNT
- bge @5
- ; The remaining block is now
- ; less than 16, so now
- ; use byte-zeroing.
-
- addi.l #16, COUNT
- bra @8
- @7: move.b d2, -(PTR)
- @8: dbra COUNT, @7
- ; Restore nuked registers.
- movem.l (sp)+,d0-d7/a0-a5
- }
- } /* BlockZero */
-
- /* eof BlockZero.c */
- - --
- /*
- * Mike Gleason, NCEMRSoft Worldwide Development. mgleason@cse.unl.edu
- * Patriots fan, and damn proud of it! "Bienvenido a Macintosh" --7.0e1
- */
-
- +++++++++++++++++++++++++++
-
- From: wombat@claris.com (Scott Lindsey)
- Date: 21 Sep 92 23:13:23 GMT
- Organization: Claris Corporation, Vancouver WA
-
- In article <D88-JWA.92Sep18144437@dront.nada.kth.se>, d88-jwa@dront.nada.kth.se (Jon Wtte) writes:
- >
- >
- > How about MOVEQ.L?
- >
- On the 68000, MOVEQ can only move 8 bits into a data register; the 8 bits are
- sign-extended into a long.
-
- MOVEQ's timing is 8(2/0); 4(1/0) on an '010.
-
- - --
- Scott Lindsey <wombat@claris.com>
-
- +++++++++++++++++++++++++++
-
- From: ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University)
- Date: 23 Sep 92 06:08:56 GMT
- Organization: University of Waikato, Hamilton, New Zealand
-
- In article <1992Sep21.223609.4608@unlinfo.unl.edu>, mgleason@cse.unl.edu (Mike Gleason) writes:
-
- > :-) I evidently forgot to mention that I was using VIA_ticks() that comes
- > with Symantec's ANSI library. According to them there are "approximately
- > 780,000" per second. Originally my test suite was using system Ticks, but
- > 60 units per second isn't enough so I decided to try using the VIA timer
- > that Symantec's profiler uses.
-
- Why not just use the Time Manager? You can get microsecond accuracy (none
- of this "approximately 780,000" business). Details in IM6.
-
- Lawrence D'Oliveiro fone: +64-7-856-2889
- Computer Services Dept fax: +64-7-838-4066
- University of Waikato electric mail: ldo@waikato.ac.nz
- Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
-
- ---------------------------
-
- End of C.S.M.P. Digest
- **********************
-